A DOM Tree Alignment Model for Mining Parallel Data from the Web

نویسندگان

Lei Shi

Cheng Niu

Ming Zhou

Jianfeng Gao

چکیده

This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the translationally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks, parallel web documents are recursively mined. Compared with previous mining schemes, the benchmarks show that this new mining scheme improves the mining coverage, reduces mining bandwidth, and enhances the quality of mined parallel sentences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model

Parallel web pages are important source of training data for statistical machine translation. In this paper, we present a new approach to sentence alignment on parallel web pages. Parallel web pages tend to have parallel structures,and the structural correspondence can be indicative information for identifying parallel sentences. In our approach, the web page is represented as a tree, and a sto...

متن کامل

Eliminating the Noise from Web Pages using Page Replacement Algorithm

Data mining is the process of mining information from the large set of data. It further has many categories like text mining web usage mining and web content mining. There are many types of algorithm which are used in web mining i.e. Visitor method, Dom tree and least recent used algorithm. Visitor and Dom tree is the complex and time consuming method. Least Recent Used algorithm is less time c...

متن کامل

Retrieve Information Using Improved Document Object Model Parser Tree Algorithm

The Data mining refers to mining the useful information from raw data or unstructured data. Whereas in web content mining the data is scattered or unstructured on web pages. Some time the user wants to retrieve only fix kind of data, but the unwanted data is also retrieved. The unnecessary information can be removed with this proposed work. The DOM Parser Tree Algorithm to filter the web pages ...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Development of a Combined System Based on Data Mining and Semantic Web for the Diagnosis of Autism

Introduction: Autism is a nervous system disorder, and since there is no direct diagnosis for it, data mining can help diagnose the disease. Ontology as a backbone of the semantic web, a knowledge database with shareability and reusability, can be a confirmation of the correctness of disease diagnosis systems. This study aimed to provide a system for diagnosing autistic children with a combinat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

A DOM Tree Alignment Model for Mining Parallel Data from the Web

نویسندگان

چکیده

منابع مشابه

Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model

Eliminating the Noise from Web Pages using Page Replacement Algorithm

Retrieve Information Using Improved Document Object Model Parser Tree Algorithm

Data Extraction using Content-Based Handles

Development of a Combined System Based on Data Mining and Semantic Web for the Diagnosis of Autism

عنوان ژورنال:

اشتراک گذاری